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Scientific  Progress 


Foreword 

Geocoding  is  determining  the  physical  location  corresponding  to  a  physical  location  identifier  such  as  postal  code,  zip  code, 
address  or  place  name.  This  enables  the  visualization  of  each  location  on  a  geographic  map.  The  importance  of  the  process 
is  demonstrated  by  the  variety  of  geo-coders  that,  given  a  postal  address,  can  output  geo-coordinates.  Texas  A&M  ,  the 
Google,  and  the  TeleAtlas  geo-coder  are  just  a  few  examples.  Postal  and  zip  codes  are  unique,  but  place  names  are  not, 
which  makes  toponym  resolution  an  area  of  research.  The  extent  of  the  ambiguity  for  place  names  is  demonstrated  by  these 
Leetaru  statistics: 

“  ...the  United  States  has  a  40%  overlap  in  geographic  names  while  the  United  Kingdom  has  a  9.7%  overlap.  The  average  for 
all  countries  is  31%  and  when  all  locations  worldwide  are  pooled  together,  33%  of  locations  share  their  name  with  another 
location  somewhere  else  in  the  world.  Northern  Africa  and  the  Middle  East  have  fairly  low  levels  of  name  duplication,  while 
North  and  South  America  and  South-Eastern  Asia  have  higher-than  normal  concentrations  of  name  duplication,  suggesting  a 
higher  level  of  potential  error  in  geocoding  locations  there.” 

List  of  Tables  and  Illustrations 

Single  attachment  in  a  Microsoft  Word  document  consisting  of  both: 

Fig.  1 :  Gold  standard  Twitter  data  for  scoring  the  geo-coding  algorithm 

Table  1 :  Performance  comparison  of  our  algorithm  and  a  state-of-the-art  algorithm  on  the  Spatial  ML  text  data 
Problem  studied 

We  looked  at  toponym  resolution  in  text  and  microtext  (tweets)  for  the  purpose  of  making  geographical  maps  to  visualize  the 
data. 

Research  question_1 :  Can  we  find  the  candidate  in  the  gazetteer  that  is  the  correct  match  for  each  toponym  in  text? 

Research  question_2:  Can  we  find  the  candidate  in  the  gazetteer  that  is  the  correct  match  for  each  toponym  in  microtext?  -- 
an  unstudied  problem. 


Method  to  solve  the  problem 

We  first  studied  the  identification  of  location  words  in  text  and  microtext  (Non-Geo/Geo  disambiguation),  and  then  we  identified 
which  physical  location  should  be  associated  with  which  location  word  (Geo/Geo  disambiguation)  because  many  places  in  the 
world  have  the  same  name. 


Geo-parsing  as  pre-requisite  to  geo-coding 

We  developed  an  algorithm  to  identify  location  words  in  text  or  tweet  as  part  of  a  DARPA  grant,  Network  Sciences  Division  , 
STTR  A013015,  Social  Media  in  Strategic  Communications  Research  .  This  algorithm  relies  both  on  string  matching  with  a 
gazetteer  and  machine  learning  to  determine  based  on  sentence  syntax  which  words  are  location  words.  Our  geo-parser  for 
the  English  language  was  used  both  to  collect  the  data  for  the  geo-coding  problem,  and  as  the  first  step  to  geo-coding. 

Geo-coding 

Our  geo-coding  algorithm  uses  machine  learning  to  associate  each  location  word  extracted  by  the  geoparser  with  candidate 
location  words  in  the  gazetteer,  and  to  determine  which  gazetteer  candidate  is  the  most  likely  match  with  the  word  extracted. 
Classification  algorithms  are  similar  if  not  identical;  it  is  the  features  used  for  the  classification  that  make  the  algorithm  perform 
adequately  or  optimally. 

Classification  features  for  geo-coding 

Whether  for  text  or  tweet,  geo-coding  uses  features  of  two  types:  those  that  indicate  which  gazetteer  candidate  is  more 
probable,  and  those  taken  from  text  or  tweet  context  that  lend  evidence  to  determine  which  physical  location  is  signified  by  the 
location  word  extracted.  Some  of  our  features  derive  from  heuristics  known  to  be  effective  for  geo-coding,  as  compiled  in 
Leidner  (2008). 

Gazetteer  features  for  disambiguating  toponyms  in  text  and  tweets  are  three.  When  numerous  gazetteer  candidates  are 
possible  matches,  prefer  the  gazetteer  candidate  that  has  a  larger  population,  that  has  more  alternate  names  in  other 
languages,  and  is  higher  in  the  geo-spatial  hierarchy. 

Context  features  for  disambiguating  text  and  tweet  provide  additional  clues  as  to  what  part  of  the  world  is  being  discussed  so 
that  the  toponym  will  be  resolved  correctly.  This  is  based  on  the  assumption  of  spatial  autocorrelation:  that  each  location 
mentioned  in  a  particular  context  is  not  independent,  but  is  instead  spatially  correlated. 


For  text,  the  context  consists  of  words  in  the  same  sentence  or  paragraph,  and  the  entire  document.  For  tweets,  the  context  is 
other  fields  in  the  raw  JSON  file  accompanying  each  tweet  as  distributed  by  the  Twitter  company.  Relevant  fields  are 
time_zone  (which  is  entered  automatically  by  Twitter),  and  the  userjocation  and  user_description  field  which  the  user  enters 
upon  creating  a  Twitter  account,  and  the  GPS  location  of  the  tweet  if  the  user  has  opted  to  include  cell-phone  coordinates  with 
each  tweet.  Another  way  to  gain  insight  into  the  geographic  context  of  a  particular  tweet  is  to  consult  the  geographic  history  of 
the  user’s  past  tweets.  The  hardware  requirement  for  this  in  terms  of  memory,  however,  would  be  enormous,  so  we  do  not 
implement  this  idea. 


Data  set(s)  to  study  the  problem 

Text.  The  Linguistic  Data  Consortium  releases  data  sets  to  test  algorithms.  Spatial  ML  is  a  data  set  of  news  articles  in  which 
toponyms  are  identified,  along  with  their  latitude  and  longitude.  We  used  version  2,  which  eliminates  some  errors  and 
inconsistencies  in  the  original  version.  Some  sentences  have  toponyms  or  location  adjectives  (see  “Canadian”  in  the 
passage  below),  and  these  have  been  hand-assigned  latitude,  longitude,  with  4655  toponyms  with  valid  latitude,  longitude 
altogether.  Control  sentences  have  no  toponyms.  Bias  in  the  data  comes  from  the  limited  number  of  ambiguous  toponyms  in 
this  data. 

Annotations  for  the  text.  Here  is  an  excerpt  from  the  data: 

Actually,  it  would  not  be  advisable  to  accept  Freddy's  comments  at  face  value.  It  is  not  acceptable  policy  and  not  legal  to  pay  an 
honorarium  or  stipend  to  directors  of  charities.  I  checked  with  a  lawyer  familiar  with  <PLACE  country="CA"  form="NAM" 
gazref="IGDB:1 9437937"  id="PI-14"  latLong="60.000°N  96.000°W"  predicative="true"  type="COUNTRY">Canadian</PLACE> 
charity  law,  who  explained  that  it  is  not  permitted  to  remunerate  board  members  of  charities  for  their  service  as  board  members. 
The  prohibition  does  not  apply  to  directors  of  nonprofit  organizations  that  are  not  charities. 

We  remove  the  <  >  tags  for  training  and  testing,  because  they  contain  the  annotations  we  use  for  scoring. 

Microtext.  As  there  is  no  tweet  set  that  has  toponyms  annotated  along  with  the  latitude  and  longitude,  we  are  annotating  a 
tweet  set  ourselves.  We  developed  a  toponym-rich  data  set  by  selecting  tweets  from  January  1,  2013  to  April  1, 2013 
downloaded  from  the  Carnegie  Mellon  University  archive  .  We  used  JSON  files  for  tweets  that  had  GPS  coordinates.  We 
prepared  the  data  by  geo-parsing  the  tweet  text  field,  and  separately  the  userjocation  and  user_description  fields  for 
toponyms,  as  well  as  the  place  fields.  The  place  fields  are  the  country  and  city  name,  the  url  for  the  city,  and  the  coordinates 
for  the  bounding  box  of  that  city,  as  resolved  automatically  from  the  IP  address  from  which  the  user  entered  the  tweet.  One 
annotator  found  1 678  entries  for  toponyml ,  the  toponyms  found  in  tweet  text. 

Annotations  for  the  microtext.  We  gave  the  annotators  tweets  that  had  been  pre-annotated  by  our  geo-parser.  That  reduced 
the  workload  for  each  annotator  since,  instead  of  supplying  annotations  for  each  tweet,  it  was  necessary  in  many  cases  only  to 
verify  that  the  geo-parser  output  was  correct,  or  correct  the  geo-parser  output  if  wrong.  The  annotators  were  asked  also  to  add 
the  geo-coordinates  for  each  entry  in  the  toponyml  field  (location  words  in  tweet  text)  and  in  the  toponym2  field  (location  words 
in  userjocation,  user_description  and  time_zone). 

The  annotator  project  took  months,  as  the  coding  is  time  consuming  and  the  coders  were  asked  to  stop  their  work  rather  than 
continue  when  tired.  The  same  was  true  for  the  adjudication  of  the  toponyms.  A  data  set  with  mistakes  will  train  the  algorithm 
to  make  mistakes,  so  a  gold  standard  data  set  is  critical.  The  result  is  that  this  part  of  the  project  is  on-going. 

We  gave  annotators  a  definition  of  what  constitutes  a  location,  and  a  spreadsheet  with  toponyms  to  code  for  the  tweetjext 
field,  and  for  the  other  tweet  fields  that  might  contain  toponyms.  Two  people  coded  the  same  tweets  independently.  We  then 
had  a  third  person  adjudicate  and  make  the  final  decision  as  to  what  the  coding  should  be.  Fig.  1  shows  an  example  of  the 
data  with  the  Toponyml  annotations  for  training  and  then  scoring  the  algorithm. 


Results 

Our  score  in  toponym  resolution  in  text  surpassed  state-of-the-art  experiments  on  the  same  data  set.  The  Lieberman  et  al., 
2012  team  on  Spatial  ML  reported  a  precision  of  .99  and  a  recall  of  .70,  with  an  FI  of  .82.  Our  score  for  precision  was  .94  but 
our  recall  was  even  higher  at  .95,  giving  us  an  FI  of  .94  (Table  1 ). 

Preliminary  geospatial  error  analysis  shows  that  only  .07%  of  algorithm  output  was  the  wrong  country,  with  .4%  of  algorithm 
output  in  the  correct  country  but  the  wrong  state,  and  .07%  of  algorithm  output  imprecise  while  in  the  correct  country  and  state. 


We  are  updating  the  code,  and  these  statistics  might  change  somewhat.  We  do  not  yet  have  the  scores  for  toponym  resolution 
in  tweet  because  the  tweet  annotations  are  still  being  completed. 


Summary  of  Most  Important  Accomplishments 

(1)  End-to-end  system  to  geo-parse  and  geo-code  text  in  English 

(2)  End-to-end  system  to  geo-parse  and  geo-code  tweet  text  in  English 

(3)  Creation  of  a  data  set  that  can  be  used  to  test  tweet  geo-coding 

(4)  Script  to  score  precision  and  recall  for  the  geo-coding  algorithm 

(5)  Script  to  evaluate  geospatial  error  for  the  geo-coding  algorithm 


(1-2)  End-to-end  system.  The  end  to  end  system  comprises  a  geo-parser  and  a  geo-coder  that  disambiguates  toponyms.  The 
geo-parser  was  discussed  in  an  earlier  paper  (Gelernter  and  Zhang,  201 3),  and  is  now  in  version  2.0.3.  Strong  toponym 
disambiguation  results  -  that  is,  accurate  geo-coding  -  is  contingent  upon  strong  results  in  identifying  toponyms,  or 
geo-parsing.  The  geo-coder  uses  the  geo-parser  output  as  input  to  find  the  best  gazetteer  match.  The  geo-coder  is  based  on 
a  machine  learning  algorithm  (SVM)  to  determine,  of  the  possible  matches  with  the  given  toponym,  which  gazetteer  toponym  is 
the  most  likely  match.  A  score  based  on  the  composite  of  the  learning  features  is  assigned  to  each  gazetteer  candidate.  The 
candidate  with  the  highest  score  is  the  algorithm  output,  along  with  the  distance  between  the  two  candidates  in  kilometers  and 
the  top  candidate’s  country  and  state/province. 

(3)  Data  for  testing.  Our  gold-standard  data  set  for  Twitter  (Fig.  1)  will  be  finished  within  a  few  weeks,  and  will  be  usable  by 
other  researchers,  provided  that  the  Twitter  company  imposes  no  legal  restrictions. 

(4)  Algorithm  to  score  geo-coding  results.  Geographic  information  retrieval  is  a  subset  of  information  retrieval,  and  as  such,  is 
evaluated  with  the  same  recall  and  precision  metrics  (Martins  et  al,  2005).  We  have  built-on  a  script  that  score  results,  so  that 
we  can  measure  algorithm  improvements  easily,  and  so  that  other  users  can  measure  their  results. 

(5)  Script  to  evaluate  geo-coding  results.  The  spatial  element,  if  included  in  a  geo-coding  algorithm  evaluation,  is  sometimes 
given  as  a  physical  measurement  along  the  earth  (for  example,  366  km.  as  the  difference  between  the  actual  and  expected 
answer).  However,  the  number  of  on-the-ground  units  between  the  actual  and  expected  location  is  less  important  than  whether 
the  geospatial  hierarchy  was  preserved,  since  a  relatively  small  offset  in  Europe  might  mean  that  the  toponym  was  resolved  to 
the  wrong  country  -  a  worse  mistake  that  resolving  to  the  right  country  but  the  wrong  state.  So  while  presenting  our  results  in 
the  standard  format  of  precision  and  recall,  we  also  introduce  a  method  of  evaluation  that  is  more  logical  to  the  data  than  typical 
methods. 
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Tweet  text 

Toponyml 

Damn  1  miss  Cebu  City!  Good  morning  Pinas! 

tp{Pinas[14,121]}tptp{Cebu 

City[10,124]}tp 

Ringing  in  the  new  year  with  two  of  my  new  best 
friends  from  2012  at  @republicMN  in  Minneapolis  :) 

tp{Minneapolis[45,-93]}tp 

Fig.  1  Two  tweets  and  the  toponyms  extracted  by  our  geo-parser  and  verified  by  an  annotator, 
with  the  annotator-added  geo-coordinates  of  each  toponym  included. 
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Table  1 :  Performance  comparison  on  the  Spatial  ML  data. 


